The problem I will be addressing is how factors such as whether the University is Private, number of application, number of students accepted, number of enrollments, class rank in High school, out of state cost, room and board, Book cost, professors with PhD’s, student to faculty ratio, and expenditure per student can predict a university's student graduation rate.
University administrators would be interested to know what factors impact a student graduating from their university and how they could possibly increase the amount that graduate.
This approach will determine which of the above variables have a significant impact, create multiple models to determine the graduation rates of universities, and calculate accuracy of the models to determine the best one to use.
Download Location: https://www.kaggle.com/yashgpt/us-college-data
library(e1071)
library(dplyr)
library(caret)
library(purrr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(tidyverse)
library(randomForest)
college_data <- read.csv("College_Data.csv")
head(college_data)
college_data$Private <- as.integer(ifelse(college_data$Private=="No", 1, 2))
# Round S.F.Ratio and convert to integer
college_data$S.F.Ratio <- round(college_data$S.F.Ratio)
college_data$S.F.Ratio <- as.integer(college_data$S.F.Ratio)
rownames(college_data) <- college_data$Uni.Name
college_data <- within(college_data, rm(Uni.Name))
options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))
for(i in colnames(college_data)){
variable = as.numeric(unlist(college_data[i]))
collegeHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
text(collegeHists$mids, collegeHists$counts, labels=collegeHists$counts, adj=c(0.5, -0.5))
xfit <- seq(min(variable), max(variable), length = nrow(college_data))
yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
yfit <- yfit * diff(collegeHists$mids[1:2]) * length(variable)
lines(xfit, yfit, col = "Blue", lwd = 2)
}
# Initialize data frame
Value_MinMax <- as.data.frame(cbind("temp" = c(0, 1)))
# Find all standard deviations that are three times away
# from the mean in the positive and negative direction
for(i in colnames(college_data)){
variable = as.numeric(unlist(college_data[i]))
Value_MinMax <- cbind(Value_MinMax,
i = c(round(mean(as.numeric(variable)) + sd(variable) * 3),
round(mean(as.numeric(variable)) - sd(variable) * 3)))
}
# Remove temp column for data frame initialization
Value_MinMax <- subset(Value_MinMax, select=-c(temp))
# Apply proper column names for the variables
colnames(Value_MinMax) <- colnames(college_data)
# Apply row names
rownames(Value_MinMax) <- c("Max Accepted Value", "Min Accepted Value")
Value_MinMax
clean_data <- college_data %>% rownames_to_column('gene') %>%
filter(Private %in% (Value_MinMax[[1]][1]:Value_MinMax[[1]][2]),
Apps %in% (Value_MinMax[[2]][1]:Value_MinMax[[2]][2]),
Accept %in% (Value_MinMax[[3]][1]:Value_MinMax[[3]][2]),
Enroll %in% (Value_MinMax[[4]][1]:Value_MinMax[[4]][2]),
Top10perc %in% (Value_MinMax[[5]][1]:Value_MinMax[[5]][2]),
Top25perc %in% (Value_MinMax[[6]][1]:Value_MinMax[[6]][2]),
F.Undergrad %in% (Value_MinMax[[7]][1]:Value_MinMax[[7]][2]),
P.Undergrad %in% (Value_MinMax[[8]][1]:Value_MinMax[[8]][2]),
Outstate %in% (Value_MinMax[[9]][1]:Value_MinMax[[9]][2]),
Room.Board %in% (Value_MinMax[[10]][1]:Value_MinMax[[10]][2]),
Books %in% (Value_MinMax[[11]][1]:Value_MinMax[[11]][2]),
Personal %in% (Value_MinMax[[12]][1]:Value_MinMax[[12]][2]),
PhD %in% (Value_MinMax[[13]][1]:Value_MinMax[[13]][2]),
Terminal %in% (Value_MinMax[[14]][1]:Value_MinMax[[14]][2]),
S.F.Ratio %in% (Value_MinMax[[15]][1]:Value_MinMax[[15]][2]),
perc.alumni %in% (Value_MinMax[[16]][1]:Value_MinMax[[16]][2]),
Expend %in% (Value_MinMax[[17]][1]:Value_MinMax[[17]][2]),
Grad.Rate %in% (Value_MinMax[[18]][1]:Value_MinMax[[18]][2])) %>%
column_to_rownames('gene')
# Structure of cleaned dataset
paste0('Cleaned Dataset Structure')
str(clean_data)
# Cleaned Dataset
paste0('Cleaned Dataset Head')
head(clean_data)
options(repr.plot.width=16, repr.plot.height=12)
par(mfrow=c(2,3))
for(i in colnames(clean_data)){
variable = as.numeric(unlist(clean_data[i]))
cleanHists <- hist(variable, main = paste(i, "Histogram", sep=" "), xlab = i)
text(cleanHists$mids, cleanHists$counts, labels=cleanHists$counts, adj=c(0.5, -0.5))
xfit <- seq(min(variable), max(variable), length = nrow(clean_data))
yfit <- dnorm(xfit, mean = mean(variable), sd = sd(variable))
yfit <- yfit * diff(cleanHists$mids[1:2]) * length(variable)
lines(xfit, yfit, col = "Blue", lwd = 2)
}
We can also see that of the original 777 records, only 92 were removed, thus leaving us with 685 total records for this projects analysis.
paste0(nrow(college_data) - nrow(clean_data), ' Records Removed')
Now that we have the final dataset, we can dive deeper into how each of the variables correlate to each other. The best way i thought to do this was to create a correlation heatmap of all of the variables in my dataset to see which ones have the highest correlation with the graduation rate(Grad.Rate)
# Correlation Matrix reordering by distance Function
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
cormat <- round(cor(clean_data), 2)
# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
# Melt the correlation matrix
melted_cormat <- melt(cormat)
# Create the correlation matrix heatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() + # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1)) +
coord_fixed()
options(repr.plot.width=14, repr.plot.height=14)
par(mfrow=c(1,1))
# Add title and correlation values to squares
ggheatmap + labs(title = "Correlation Matrix Heatmap") +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
text = element_text(size = 20),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank())
Looking at the heatmap we can see that the variable with the highest correlation with Grad.Rate is Outstate at 0.54 and perc.alumni coming in close with 0.48. Neither of these values are good correlation with Grad.Rate and the rest of the variables fair far worse.
Before making any more assumptions i decided to run linear models on Grad.Rate with each of the other variables to determine their R^2 values and calculate the RMSE. These values, combined with the heatmap, will give me a better idea of how the variables relate to Grad.Rate and if there is any significance between them.
Calculate RMSE and R^2 values and store values into a dataframe for easy plotting
# RMSE calculation Function
RMSE <- function(error) { sqrt(mean(error^2)) }
# Initialize Dataframe to hold R^2 and RMSE values
RRMSE_df <- as.data.frame(rbind("temp" = c(0, 1, 2, 4)))
# All Variable Names
nameList <- names(clean_data)
# Loop to create linear models with each of the variables, with Grad.Rate as the independent variable
for (x in nameList){
# Creates the models
model <- lm(substitute(Grad.Rate ~ i, list(i = as.name(x))), clean_data)
# Appending values to dataframe
RRMSE_df <- rbind(RRMSE_df, i = c(x, summary(model)$r.squared,
summary(model)$adj.r.squared,
RMSE(model$residuals)))
}
# Remove "temp" record from dataframe initialization
RRMSE_df <- RRMSE_df[-1,]
# Apply column names to new dataframe
colnames(RRMSE_df) <- c("Name", "R^2", "Adjusted R^2", "RMSE")
# Round each column values to 4 decminal places
RRMSE_df$`R^2` <- round(as.numeric(RRMSE_df$`R^2`), 4)
RRMSE_df$`Adjusted R^2` <- round(as.numeric(RRMSE_df$`Adjusted R^2`), 4)
RRMSE_df$RMSE <- round(as.numeric(RRMSE_df$RMSE), 4)
# Reorder R^2 column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$`R^2`), drop = FALSE, ]
# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, `R^2`))
options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))
# R^2 Graph from Models
ggplot(RRMSE_df, aes(R.2, Name)) +
labs(title = "R^2 of each model with Grad.Rate as the Indepedent Variable",
x = "R^2", y = "Variable Name") +
scale_x_continuous(breaks = round(seq(min(RRMSE_df$R.2), max(RRMSE_df$R.2), by = 0.02), 2)) +
geom_point() +
theme(text = element_text(size = 20))
We are looking for the variable with the highest R^2 and from this graph we can see that OutState has the highest R^2 of about 0.29.
This is not a good R^2 value but we do need to keep in mind that these models were only with one variable each and further modeling with multiple variables still needs to be done. This was simply to get an idea of how each of the variables fair on their own against Grad.Rate.
options(repr.plot.width=14, repr.plot.height=10)
par(mfrow=c(1,1))
# Reorder RMSE column from least to greatest value
RRMSE_df <- RRMSE_df[order(RRMSE_df$RMSE), drop = FALSE, ]
# Apply factor levels to prevent ggplot from reordering variable values
RRMSE_df$Name <- factor(RRMSE_df$Name, levels=unique(as.character(RRMSE_df$Name)))
RRMSE_df <- transform(RRMSE_df, Name=reorder(Name, RMSE))
# RMSE calculation graph from model$residuals
ggplot(RRMSE_df, aes(RMSE, Name)) +
labs(title = "RMSE of each model$residuals with Grad.Rate as the Indepedent Variable",
x = "RMSE", y = "Variable Name") +
scale_x_continuous(breaks = round(seq(min(RRMSE_df$RMSE), max(RRMSE_df$RMSE), by = 0.2), 1)) +
geom_point() +
theme(text = element_text(size = 20))
We are looking for the variable with the lowest RMSE and from this graph we can see that Outstate has the lowest RMSE at about 13.7. The RMSE value on its own means nothing, it needs to be compared to the RMSE's of the other models, which is why I graphed the other RMSE's of the other models as well. This is what allows us to determine that Outstate has the lowest RMSE. Ideally we are looking for the model with the highest R^2 and the lowest RMSE, which right now appears to be Grad.Rate ~ Outstate.
With my current analysis so far, this is not a good model in predicting the graduation rate of universities and my conclusion would be that these variables are not good predictors of whether or not a student graduates. My next steps are to start combining multiple variables into models to see if I can improve those scores before moving on to other machine learning algorithms for further analysis.
The next step is start making models to find the significant variables and hopefully create better models as a result of what we know so far. I have created 4 linear models below and we will look at each of them to see the process i went through to arrive at the one i have selected.
clean_lm1 <- lm(Grad.Rate ~ ., data = clean_data)
# Summary of the model
summary(clean_lm1)
We are looking for all of the variables with a p-value less than 0.05, so make it easier to see I have filtered all the other variables that do not fit that criteria.
This list shows us all the variables that are significant in this model and thus in the next model we will only use these variables and then repeat the process until we have a desired model.
p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
clean_lm2 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + Top25perc +
Apps + Private + Personal + PhD + P.Undergrad, clean_data)
summary(clean_lm2)
p_values <- data.frame(cbind("P.value" = summary(clean_lm2)$coef[summary(clean_lm2)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
We can see from the new p-value list from this model that PhD is no longer significant. As a result, we can drop it from the next model. We can check the improvement of each of these models by using anova.
anova(clean_lm1, clean_lm2)
Ideally we want the Pr(>F) value to be less than 0.05, so while the model improved it did not improve enough to reject the null hypothesis and be significant.
clean_lm3 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board +
Top25perc + Apps + Private + Personal + P.Undergrad, clean_data)
summary(clean_lm3)
p_values <- data.frame(cbind("P.value" = summary(clean_lm3)$coef[summary(clean_lm3)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
anova(clean_lm1, clean_lm2, clean_lm3)
From the anova() function, the third model has improved a lot from the second model but the Pr(>F) value is still too low to reject the null hypothesis. So going back to the third models p-value list, we will drop the variables with the biggest p-value.
To speed things up i will be dropping the Room.Board, Personal, and P.Undergrad variables.
clean_lm4 <- lm(Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private, clean_data)
summary(clean_lm4)
p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
anova(clean_lm1, clean_lm2, clean_lm3, clean_lm4)
We can see from the anova() function that the fourth model has improved significantly from the previous models, with a Pr(>F) value way below 0.05. With this value we can say that the model rejects the null hypothesis. Based on all the information i have been able to deduce from my data, this is the best model i can come up with and will be used for the remainder of the analysis.
options(repr.plot.width=16, repr.plot.height=10)
models <- paste0("clean_lm", 1:4)
models_df <- as.data.frame(cbind("Name" = as.character(models),
"R^2" = sapply(models, function(x)round(summary(get(x))$r.squared, 4)),
"RMSE" = sapply(models, function(x)round(RMSE(get(x)$residuals), 4))))
# Reorder R^2 column from least to greatest value
models_df <- models_df[order(models_df$`R^2`), drop = FALSE, ]
# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(`R^2`)))
rsquared_Plot <- ggplot(models_df, aes(R.2, Name)) +
labs(title = "R^2 of each of the Four models",
x = "R^2", y = "Models") +
geom_point(size=3) +
theme(text = element_text(size = 20))
# Reorder RMSE column from least to greatest value (dont eval this)
models_df <- models_df[order(models_df$RMSE, decreasing = TRUE), drop = FALSE, ]
# Apply factor levels to prevent ggplot from reordering variable values
models_df$Name <- factor(models_df$Name, levels=unique(as.character(models_df$Name)))
models_df <- transform(models_df, Name=reorder(Name, as.numeric(RMSE)))
# RMSE calculation graph from model$residuals
RMSE_Plot <- ggplot(models_df, aes(RMSE, Name)) +
labs(title = "RMSE of each of the Four models",
x = "RMSE", y = "Models") +
geom_point(size=3) +
theme(text = element_text(size = 20))
grid.arrange(rsquared_Plot, RMSE_Plot)
While the fourth model does not have the largest R^2 or the smallest RMSE compared to the other models, the anova function tells us that it is the model with the best improvement from the previous models.
We can also take a look at the models in these graph views below, the main one we want to look at is the Normal Q-Q plot in the top right corner. We want the grouping of points to be as close to the dotted line as possible.
options(repr.plot.width=16, repr.plot.height=10)
# Plotting Function for plotting all of the linear models graphs
plot_func <- function(x){
par(mfrow=c(2,2))
plot(get(x))
title(x, line = -2, outer = TRUE)
}
# Plots the models
sapply(models, plot_func) + theme(text = element_text(size = 20))
As you look through the analysis graphs, the point grouping on Normal Q-Q plot gets closer and closer to the line, with the last model (clean_lm4) having the points closest to the line. All of these graphs still have skew on right hand tail but the fourth model is the best fit of them all.
For comparisons sake, i will do the same tests with the first model (using all variables) to compare the result of the fourth model.
Now that I have the models that i want to use, i will be using some machine learning algorithms to start training and testing my data predictions and find the accuracy of my models.
First i will find the accuracy of my linear models predictions on my data and then i will move onto using randomforest and K Nearest Neighbor (knn) models.
# The models I will be using
model1 <- (Grad.Rate ~ .)
model4 <- (Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private)
# First split the data into train and tests sets,
# train will have 80% and test will have 20% of the data
in_train <- createDataPartition(clean_data$Grad.Rate, p = 0.8, list = FALSE)
# Create train and test datasets
train <- clean_data[in_train,]
test <- clean_data[-in_train,]
# Convert the Grad.Rate to a factor
test$Grad.Rate <- as.factor(test$Grad.Rate)
# Train our first model, lm1
clean_lm1 <- train(model1, method = "lm", data = train)
# Train our fourth model, lm4
clean_lm4 <- train(model4, method = "lm", data = train)
# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind("clean_lm1" = cbind(clean_lm1$results["RMSE"], clean_lm1$results["Rsquared"]),
"clean_lm4" = cbind(clean_lm4$results["RMSE"], clean_lm4$results["Rsquared"]))
# clean_lm1's Output
print(clean_lm1)
# clean_lm4's Output
print(clean_lm4)
From the result of just linear models so far, it looks like the fourth model faired just a bit better than the first model with all of the variables, which is good because it means that the variables I removed from the first model did in fact improve the fourth model.
From each of the models output, we are looking at the RMSE and R^2 values. The RMSE is how close on average the predicted value was to the actual value, so the closer this value is to 0 the better. The bigger the R^2 value the better, with 1 being the best because it means that the independent variable explains more of the variance in the model from the dependent variables. We will need to complete all the other models and compare all the value together before we can draw any conclusions from their results.
# Train first reandomForest model
clean_rf1 <- train(model1, method = "rf", data = train, trcontrol = trainControl(method = "none"))
# Train fourth randomForest model
clean_rf4 <- train(model4, method = "rf", data = train, trcontrol = trainControl(method = "none"))
# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data,
"clean_rf1" = cbind("RMSE" = clean_rf1$results["RMSE"][[1]][1],
"Rsquared" = clean_rf1$results["Rsquared"][[1]][1]),
"clean_rf4" = cbind("RMSE" = clean_rf4$results["RMSE"][[1]][1],
"Rsquared" = clean_rf4$results["Rsquared"][[1]][1]))
# clean_rf1's Output
print(clean_rf1)
# clean_rf4's Output
print(clean_rf4)
The randomForest model works by tuning the trees to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the first record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.
# Train the first knn model
clean_knn1 <- train(model1, method = "knn", data = train, trcontrol = trainControl(method = "none"))
# Train the fourth knn model
clean_knn4 <- train(model4, method = "knn", data = train, trcontrol = trainControl(method = "none"))
# Add RMSE's and R^2's to dataframe
RMSE_data <- rbind(RMSE_data,
"clean_knn1" = cbind("RMSE" = clean_knn1$results["RMSE"][[1]][1],
"Rsquared" = clean_knn1$results["Rsquared"][[1]][1]),
"clean_knn4" = cbind("RMSE" = clean_knn4$results["RMSE"][[1]][1],
"Rsquared" = clean_knn4$results["Rsquared"][[1]][1]))
# clean_knn1's Output
print(clean_knn1)
# clean_knn4's Output
print(clean_knn4)
The knn model works by finding which number of clusters is best for the data to get the best RMSE and R^2. From the model output we see three records under the sample results. Out of all of the results, the third record has the best RMSE and R^2 so it will be recorded and used to measure its effectiveness against the other models.
With the analysis finished we can move on to answering the research questions and draw final conclusions about the analysis of this dataset.
options(repr.plot.width=16, repr.plot.height=6)
par(mfrow=c(1,1))
for (i in c(1:ncol(RMSE_data))){
plot(RMSE_data[[i]], main = paste(colnames(RMSE_data[i]), " of all of models", sep=""),
xaxt = "n", ylab = colnames(RMSE_data[i]),
xlab = "Models", col="lightblue", pch=19, cex=6)
axis(1, at=1:6, labels=rownames(RMSE_data))
abline(v=c(2.5, 4.5))
# Add values to points for easier viewing
text(RMSE_data[[i]], labels=round(RMSE_data[[i]], 3), cex=0.8, font=2)
}
Overall, looking at the graphs above, the model with the lowest RMSE and the highest R^2 is the 1st randomforest model (clean_rf1). The randomforest models faired about the same (just slightly better) as the linear models and curiously, the knn models faired far worse than both the linear and randomForest models. However, the R^2 is still very low and the RMSE means that model is only able to predict the Grad Rate of a University from within about the RMSE margin. In terms of a graduation rate, that is quite a large margin and as a result, this should probably not be used to predict a universities' graduation rate.
First Model P-Values
# First linear model with all variables
p_values <- data.frame(cbind("P.value" = summary(clean_lm1)$coef[summary(clean_lm1)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
Fourth Model P-Values
# Fourth linear model
p_values <- data.frame(cbind("P.value" = summary(clean_lm4)$coef[summary(clean_lm4)$coef[,4] <= .05, 4]))
# Order P-values from least to greatest
p_values[order(p_values), , drop = FALSE]
From the above lists we can see that for the first model the variables: perc.alumni, Expend, Outstate, Top25perc, Private, Personal, Room.Board, and Apps
were all significant.
Since the fourth model performed better than the first one, i am inclined to say that the variables listed as significant in the fourth model have a bigger impact on the data than those in the first model: perc.alumni, Top25perc, Outstate, Apps, Expend, and Private
.
options(repr.plot.width=16, repr.plot.height=10)
par(mfrow=c(2,2))
for(i in c(30, 50, 70, 90)){
rate <- clean_data %>% rownames_to_column('gene') %>%
filter(Grad.Rate >= i) %>%
column_to_rownames('gene')
tempHist <- hist(rate$Grad.Rate, main = paste("Grad.Rate > ", i, "% (Total: ", count(rate), ")",sep=""),
xlab = "Graduation Rate (%)")
text(tempHist$mids,tempHist$counts, labels=tempHist$counts, adj=c(0.5, -0.5))
}
# Copy dataframe
newPred <- clean_data
# Change the number of teachers with PhD's to 120 for all universities
newPred$PhD <- 120
# Make predictions using clean_lm4 because it had the best RMSE and R^2
pred <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = newPred), 0)))
# Total Universities
print(paste0('Total Universities: ', sum(table(pred$Grad.Rate > clean_data$Grad.Rate))))
# Comparison of the grad rates
print('Comparision of Grad Rates')
table(pred$Grad.Rate > clean_data$Grad.Rate)
From the comparison we can see that there is a pretty even split in the data for which universities saw an increase or a decrease in their graduation rate, however, it is important to note that PhD had a very small impact on the models during our analysis and when running the same prediction with the clean dataset, we get the same results. As a result with this model, i am leaning towards that the number of teachers with PhD's does not have significant influence on the graduation rate.
# Get list of private universities for prediction and clean datasets
privated <- clean_data %>% filter(Private == 2)
clean_priv <- clean_data %>% filter(Private == 2)
# Get list of public universities for prediction and clean datasets
public <- clean_data %>% filter(Private == 1)
clean_publ <- clean_data %>% filter(Private == 1)
# Change private schools to public
privated$Private <- 1
# Change public schools to private
public$Private <- 2
# Make predictions using clean_lm4 because it had the best RMSE and R^2
# Private to Public prediction
pred_Priv <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = privated), 0)))
# Public to Private Prediction
pred_Publ <- as.data.frame(cbind("Grad.Rate" = round(predict(clean_lm4, newdata = public), 0)))
# Comparison of grad rates to original dataset
# Total Universities that were originally private
paste0('Total Universities that were originally private: ', sum(table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)))
# Private to Public comparison
print('Private to Public Comparison')
table(pred_Priv$Grad.Rate > clean_priv$Grad.Rate)
# Total Universities that were originally public
print('Total universities that were originally public')
sum(table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate))
# Public to Private comparison
print('Public to Private Comparison')
table(pred_Publ$Grad.Rate > clean_publ$Grad.Rate)
From the first comparison (Private to Public) of 516 originally private universities, about 32% of the total saw an increase in their Grad Rate and the rest (about 68%) either saw a decrease or it stayed the same. Based on this, i would say that if a private university decides to become public then it could be assumed that they would experience a decrease in their graduation rate.
From the second comparison (Public to Private) of 169 originally public universities, about 67% of the total saw an increase in their Grad Rate and the rest (about 33%) either saw a decrease or it stayed the same. Based on this, i would say that if a public university decides to become private then it could be assumed that they would experience an increase in their graduation rate.
However, we do still need to keep in mind that this result is based a model with a wide error margin and should not be taken too confidently.
In conclusion, based on all of the analysis, I am confident when I say that this dataset does not have either enough data or the right variables measured to predict the graduation rate of these universities to a confident degree. The models RMSE's and R^2's are not good enough to be able to use the models for graduation rate prediction of the universities. I would recommend that more data be collected and other variables of measure be collected as well to get a better spread of data to be to get more accurate models.
It can mean that the variables measured in this dataset do not actually have an effect on the graduation rate and thus the university administration does not need to spend more time and/or resources improving upon them to boost their grad rate. However, as i said above, i do believe that there needs to be more data collected and possibly more variable measurements added as well to more confidently agree with that statement.
This dataset is only limited to one year of data collection and I think it would benefit greatly if multiple years of data from these universities could be used, then a far more accurate model could be trained to predict their graduation rates. I also think that in order to improve this dataset and by extension the models derived from it, is to collect other bits of data about the university as well, such as: the number of students who graduate, the number of students who drop out by their school year (freshman, Sophomore, Junior, Senior) and possibly student demographic data (to get a better idea of the makeup of the student population).